Blackwell Electronics

Customer Analytics Report — Purchasing Behaviour & Demographics

Author

Freda Erinmwingbovo

Published

February 22, 2026

BLACKWELL ELECTRONICS CONFIDENTIAL INTERNAL REPORT
Customer Analytics

Purchasing Behaviour
& Demographics

An investigation into regional spending patterns, customer demographics, and predictive modelling of purchase behaviour across 80,000 transactions.

PREPARED BY Freda Erinmwingbovo
PREPARED FOR Danielle Sherman, CTO
DATE February 2026
RECORDS ANALYSED 80,000 Transactions

1 Introduction and Data Loading

1.1 Background & Objectives

Blackwell Electronics has operated as a successful electronics retailer for over three decades. Following the launch of their eCommerce platform just over a year ago, the business has begun accumulating valuable customer transaction data, both online and in-store.

This report presents a data-driven investigation into customer purchasing behaviour, commissioned by Danielle Sherman (CTO) and informed by the business questions raised by VP of Sales, Martin Goodrich. The analysis is structured around three core questions:

  1. Regional Spending: Do customers in different regions spend more per transaction? Which regions spend the most and least?
  2. Demographics & Channel: Are older customers more likely to shop in-store than online? What factors predict how a customer shops?
  3. Predictive Modelling: Can we predict a customer’s region, their purchase channel, or their age from the available data?

The dataset contains 80,000 customer transactions, capturing age, number of items purchased, transaction amount, purchase channel (online vs. in-store), and region.

1.2 The Libraries Installed

The analysis draws on R’s tidyverse ecosystem for data manipulation and visualisation, and the caret framework for machine learning; an industry-standard approach for building and evaluating predictive models.

Show Code
# Data manipulation and visualisation
library(tidyverse)   # Core suite: dplyr, ggplot2, readr, tidyr etc.
library(scales)      # For formatting axes (currency, percentages)
library(knitr)       # For clean tables in the report
library(kableExtra)  # For styled, professional tables

# Machine learning
library(caret)       # Classification and regression training framework
library(randomForest) # Random Forest algorithm
library(nnet)        # Multinomial logistic regression (for region prediction)

# Report styling
library(ggthemes)    # Additional ggplot2 themes

1.3 Importing the Data

Show Code
# Import the customer transaction dataset
df <- read_csv("Demographic_Data.csv")

# Rename columns for clarity and consistency
df <- df %>%
  rename(
    channel  = `in-store`,
    age      = age,
    items    = items,
    amount   = amount,
    region   = region
  )

# Convert categorical variables to factors with meaningful labels
df <- df %>%
  mutate(
    channel = factor(channel, levels = c(0, 1), 
                     labels = c("Online", "In-Store")),
    region  = factor(region, levels = c(1, 2, 3, 4),
                     labels = c("North", "South", "East", "West"))
  )

# Preview the data
glimpse(df)
Rows: 80,000
Columns: 5
$ channel <fct> Online, Online, In-Store, In-Store, In-Store, Online, Online, …
$ age     <dbl> 37, 35, 45, 46, 33, 24, 43, 66, 51, 47, 44, 33, 32, 48, 34, 43…
$ items   <dbl> 4, 2, 3, 3, 4, 3, 6, 5, 5, 3, 5, 2, 2, 7, 6, 7, 7, 4, 2, 7, 7,…
$ amount  <dbl> 281.0300, 219.5100, 1525.7000, 715.2500, 1937.5000, 1314.2000,…
$ region  <fct> South, South, West, East, North, West, South, East, North, Nor…

2 Data Preprocessing

Before any analysis can begin, it is essential to examine the quality and structure of the data. Raw data even from internal systems can contain errors, inconsistencies, missing values, or extreme observations that could distort results and lead to misleading conclusions.

This section documents every step taken to inspect, clean, and validate the dataset prior to analysis. All decisions made during preprocessing are explicitly justified so that the analytical process is fully transparent and reproducible.

2.1 Data Inspection

The first step is simply to look at the data carefully: its dimensions, structure, data types, and a statistical snapshot of every variable. This gives us an immediate sense of what we are working with and flags anything that needs attention before we proceed.

Show Code
# --- Dimensions ---
cat("Rows:", nrow(df), "\n")
Rows: 80000 
Show Code
cat("Columns:", ncol(df), "\n")
Columns: 5 
Show Code
# --- First 6 rows ---
df %>% head() %>% kable() %>% kable_styling(full_width = TRUE)
channel age items amount region
Online 37 4 281.03 South
Online 35 2 219.51 South
In-Store 45 3 1525.70 West
In-Store 46 3 715.25 East
In-Store 33 4 1937.50 North
Online 24 3 1314.20 West
Show Code
# --- Clean styled summary table ---
summary_table <- data.frame(
  Variable = c("channel", "age", "items", "amount", "region"),
  Type     = c("Factor", "Numeric", "Numeric", "Numeric", "Factor"),
  Min      = c("—", "18", "1", "$5.01", "—"),
  Mean     = c("—", "45.76", "4.51", "$835.92", "—"),
  Median   = c("—", "45", "4", "$582.32", "—"),
  Max      = c("—", "85", "8", "$3,000", "—")
)

summary_table %>%
  kable(align = c("l","l","r","r","r","r","l")) %>%
  kable_styling(full_width = TRUE)
Variable Type Min Mean Median Max
channel Factor
age Numeric 18 45.76 45 85
items Numeric 1 4.51 4 8
amount Numeric $5.01 $835.92 $582.32 $3,000
region Factor

2.1.1 Interpretation

The dataset contains 80,000 rows and 5 columns. Each row represents a single customer transaction. A statistical summary reveals the following initial observations:

  • Age ranges from 18 to 85, with a mean of approximately 46 years.
  • Items purchased per transaction ranges from 1 to 8, averaging around 4 to 5 items.
  • Amount spent ranges from near zero to several thousand dollars, suggesting potential outliers at both ends that will need investigation.
  • Channel and Region are correctly encoded as factors with their respective labels.

No immediately obvious structural issues are visible, but we will now systematically check for missing values, duplicates, and outliers.

2.2 Missing Values

Missing data is one of the most common data quality issues in real-world datasets. If left unaddressed, missing values can cause errors in statistical functions and machine learning models, or silently produce biased results. We systematically check every column for the presence of NA values before proceeding.

Show Code
# --- Count of missing values per column ---
missing_summary <- df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing_Count") %>%
  mutate(
    Total_Rows      = nrow(df),
    Missing_Percent = round((Missing_Count / Total_Rows) * 100, 2)
  )

# --- Display as a clean table ---
missing_summary %>%
  kable(
    col.names = c("Variable", "Missing Count", "Total Rows", "Missing (%)"),
    caption   = "Table: Missing Value Summary"
  ) %>%
  kable_styling(full_width = FALSE)
Table: Missing Value Summary
Variable Missing Count Total Rows Missing (%)
channel 0 80000 0
age 0 80000 0
items 0 80000 0
amount 0 80000 0
region 0 80000 0

2.2.1 Interpretation

The missing value check confirms that the dataset is complete. Every column registers zero missing values across all 80,000 records. This is an important quality indicator, as it means no imputation strategy is required and we can proceed to the next preprocessing step with confidence that no records will be silently dropped during analysis.

2.3 Duplicate Records

Duplicate records occur when the same transaction is captured more than once in a dataset. This can happen due to system errors, data entry mistakes, or merging of multiple data sources. Duplicates artificially inflate sample sizes and can bias both descriptive statistics and machine learning models. We identify and remove any exact duplicate rows before proceeding.

Show Code
# --- Identify and save duplicates before removal ---
dupes <- df[duplicated(df), ]
duplicate_count <- nrow(dupes)

# --- Remove duplicates ---
df <- df %>% distinct()

# --- Print counts together ---
cat("Duplicate rows detected:", duplicate_count, "\n",
    "Rows after deduplication:", nrow(df), "\n",
    "Columns:", ncol(df))
Duplicate rows detected: 21 
 Rows after deduplication: 79979 
 Columns: 5
Show Code
# --- Preview of removed duplicates as styled table ---
dupes %>%
  head() %>%
  kable(caption = "Table: Preview of Duplicate Records Removed") %>%
  kable_styling(full_width = FALSE)
Table: Preview of Duplicate Records Removed
channel age items amount region
Online 76 6 283.61 South
Online 67 4 320.41 South
In-Store 23 2 1019.50 West
In-Store 38 3 1993.80 North
In-Store 31 5 1488.10 West
Online 39 5 1268.00 West

2.3.1 Interpretation

The duplicate check identified 21 exact duplicate records within the dataset. These rows were identical across all five variables, suggesting they are most likely the result of a data entry or system logging error rather than genuine repeat transactions. It is statistically improbable that 21 different customers would independently purchase the exact same number of items for the exact same amount in the same region through the same channel.

All 21 duplicate rows were removed using distinct(), reducing the dataset from 80,000 to 79,979 records. This is a negligible reduction of just 0.026% and will have no meaningful impact on the analysis, but it ensures the integrity of every subsequent step.

2.4 Data Types Validation

Every variable in a dataset must be stored as the correct data type. A variable stored in the wrong type can cause silent errors. For example, a categorical variable stored as a numeric will be treated as a continuous measurement by statistical functions and machine learning algorithms, producing nonsensical results. We verify that each variable is correctly typed before proceeding.

Show Code
# --- Check data types of all columns ---
data_types <- df %>%
  summarise(across(everything(), class)) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Data_Type") %>%
  mutate(
    Expected_Type = c("Factor", "Numeric", "Numeric", "Numeric", "Factor"),
    Status = ifelse(Data_Type == c("factor", "numeric", 
                                   "numeric", "numeric", "factor"),
                    "✔ Correct", "✘ Incorrect")
  )

# --- Display as clean table ---
data_types %>%
  kable(
    col.names = c("Variable", "Current Type", "Expected Type", "Status"),
    caption   = "Table: Data Types Validation"
  ) %>%
  kable_styling(full_width = FALSE)
Table: Data Types Validation
Variable Current Type Expected Type Status
channel factor Factor ✔ Correct
age numeric Numeric ✔ Correct
items numeric Numeric ✔ Correct
amount numeric Numeric ✔ Correct
region factor Factor ✔ Correct

2.4.1 Interpretation

All five variables are correctly typed:

  • Channel and Region are stored as factors, meaning R treats them as discrete categories rather than continuous numbers. This is critical, for without this, a region coded 1 to 4 would be interpreted as a measurement where West (4) is somehow “greater than” North (1), which is meaningless.

  • Age, Items, and Amount are stored as numeric, which is appropriate as they represent measurable quantities that can be averaged, summed, and used in regression models.

No data type corrections are required. We proceed to outlier detection.

2.5 Outlier Detection

Outliers are observations that fall unusually far from the rest of the data. They can arise from genuine extreme behaviour, such as a customer making a very large purchase or from data entry errors. Either way, they require careful investigation because they can heavily distort statistical measures like the mean, and can negatively impact the performance of machine learning models.

We use two complementary approaches to detect outliers:

  1. Visual inspection: boxplots to see the distribution of each numeric variable and immediately spot extreme values
  2. The IQR Method: a statistical rule that flags any observation falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR as a potential outlier. This is the industry standard for outlier detection.

We examine the three numeric variables: age, items, and amount.

Show Code
# --- Define a consistent colour palette ---
palette <- c(
  "age"    = "#00d4ff",
  "items"  = "#7b61ff",
  "amount" = "#f0c040"
)

# --- Reshape data for faceted boxplots ---
df_long <- df %>%
  select(age, items, amount) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Value")

# --- Boxplots ---
ggplot(df_long, aes(x = Variable, y = Value, fill = Variable)) +
  geom_boxplot(outlier.colour = "#ff4d4d",
               outlier.alpha  = 0.3,
               outlier.size   = 0.8,
               width          = 0.5) +
  facet_wrap(~ Variable, scales = "free", nrow = 1) +
  scale_fill_manual(values = palette) +
  labs(
    title    = "Figure: Boxplot Distribution of Numeric Variables",
    subtitle = "Red dots indicate potential outliers flagged by the IQR method",
    x        = NULL,
    y        = "Value"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.background    = element_rect(fill = "#0a0a0f", color = NA),
    panel.background   = element_rect(fill = "#12121a", color = NA),
    panel.grid.major   = element_line(color = "#2a2a3a"),
    panel.grid.minor   = element_blank(),
    strip.background   = element_rect(fill = "#1a1a2e", color = NA),
    strip.text         = element_text(color = "#00d4ff", 
                                      face  = "bold", 
                                      size  = 10),
    plot.title         = element_text(color = "#ffffff", 
                                      face  = "bold", 
                                      size  = 13),
    plot.subtitle      = element_text(color = "#8a8a9a", size = 10),
    axis.text          = element_text(color = "#8a8a9a"),
    axis.title         = element_text(color = "#8a8a9a"),
    legend.position    = "none"
  )

Show Code
# --- IQR Method: Count outliers per variable ---
count_outliers <- function(x) {
  Q1  <- quantile(x, 0.25)
  Q3  <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  sum(x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR))
}

outlier_summary <- df %>%
  summarise(
    Age    = count_outliers(age),
    Items  = count_outliers(items),
    Amount = count_outliers(amount)
  ) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Outlier_Count") %>%
  mutate(
    Total_Rows      = nrow(df),
    Outlier_Percent = round((Outlier_Count / Total_Rows) * 100, 2)
  )

# --- Display outlier summary table ---
outlier_summary %>%
  kable(
    col.names = c("Variable", "Outlier Count", 
                  "Total Rows", "Outliers (%)"),
    caption   = "Table: Outlier Count by Variable (IQR Method)"
  ) %>%
  kable_styling(full_width = FALSE)
Table: Outlier Count by Variable (IQR Method)
Variable Outlier Count Total Rows Outliers (%)
Age 0 79979 0.0
Items 0 79979 0.0
Amount 2320 79979 2.9

2.5.1 Interpretation

The boxplots and IQR analysis reveal a clear and important finding:

  • Age: Zero outliers detected. The age range of 18 to 85 years is entirely plausible for a retail electronics customer base. The distribution is clean and requires no intervention.

  • Items : Zero outliers detected. With a natural ceiling of 8 items per transaction, this variable is tightly bounded and consistent throughout the dataset.

  • Amount : 2,320 outliers detected, representing 2.9% of all transactions. The boxplot clearly shows a right-skewed distribution, with a cluster of transactions at the higher end of the spending range pulling the mean ($835.92) significantly above the median ($582.32). These are transactions where customers spent considerably more than the typical customer.

The critical question is: are these outliers errors, or are they real? In an electronics retail context, high-value transactions are entirely plausible. A customer purchasing a high-end television, laptop, or multiple premium devices in a single transaction could legitimately spend at the upper end of the range. The minimum amount of $5.00 and maximum of $3,000.00 are both realistic for an electronics retailer.

These outliers are therefore likely genuine extreme values rather than data entry errors, and the decision of how to treat them will be addressed in the next step.

2.6 Outlier Treatment

Having identified 2,320 outliers in the amount variable, we must now decide how to handle them. There are three common approaches:

  1. Remove them: delete all flagged rows from the dataset. This is appropriate when outliers are confirmed errors. However, removing 2.9% of genuine transactions would mean discarding real customer behaviour and potentially biasing our analysis against high-value customers.

  2. Cap them (Winsorisation): replace values beyond the IQR boundaries with the boundary value itself. This retains all records but reduces the influence of extreme values.

  3. Retain them: keep the outliers as-is, on the basis that they represent genuine behaviour. This is appropriate when the data generating process naturally produces extreme values.

Given the business context, an electronics retailer where high-value purchases are entirely expected, a retain and document strategy is adopted. The outliers in amount reflect real purchasing behaviour from high-spending customers, which is precisely the kind of insight Blackwell’s marketing team needs to understand. Removing or capping these values would obscure an important segment of their customer base.

However, to ensure the statistical models are not unduly distorted, a log transformation is applied to amount where appropriate during the modelling phase. For now, the outliers is documented and proceed with the full dataset intact.

Show Code
# --- Define IQR boundaries for amount ---
Q1  <- quantile(df$amount, 0.25)
Q3  <- quantile(df$amount, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

cat("Amount IQR Boundaries:\n")
Amount IQR Boundaries:
Show Code
cat("  Lower Bound: $", round(lower_bound, 2), "\n")
  Lower Bound: $ -1137.3 
Show Code
cat("  Upper Bound: $", round(upper_bound, 2), "\n")
  Upper Bound: $ 2655.82 
Show Code
cat("  Outliers below lower bound:", sum(df$amount < lower_bound), "\n")
  Outliers below lower bound: 0 
Show Code
cat("  Outliers above upper bound:", sum(df$amount > upper_bound), "\n\n")
  Outliers above upper bound: 2320 
Show Code
# --- Decision: Retain outliers ---
# We retain all records. Log-transformed amount created for 
# modelling use later.
df <- df %>%
  mutate(amount_log = log1p(amount))

cat("Log-transformed amount column added.\n")
Log-transformed amount column added.
Show Code
cat("Final dataset dimensions:\n")
Final dataset dimensions:
Show Code
cat("  Rows:", nrow(df), "| Columns:", ncol(df), "\n")
  Rows: 79979 | Columns: 6 
Show Code
# --- Visual: Original vs Log-transformed amount ---
p1 <- ggplot(df, aes(x = amount)) +
  geom_histogram(fill = "#f0c040", color = NA, bins = 60, alpha = 0.85) +
  labs(title = "Original Amount", x = "Amount ($)", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", face = "bold"),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a")
  )

p2 <- ggplot(df, aes(x = amount_log)) +
  geom_histogram(fill = "#00d4ff", color = NA, bins = 60, alpha = 0.85) +
  labs(title = "Log-Transformed Amount", x = "Log(Amount + 1)", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", face = "bold"),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a")
  )

# --- Display side by side ---
library(patchwork)
p1 + p2 +
  plot_annotation(
    title   = "Figure : Amount Distribution: Original vs Log-Transformed",
    subtitle = "Log transformation compresses the right skew for modelling use",
    theme   = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff", 
                                     face = "bold", size = 13),
      plot.subtitle   = element_text(color = "#8a8a9a", size = 10)
    )
  )

2.6.1 Interpretation

The IQR analysis calculates a lower bound of -$1,137.30 and an upper bound of $2,655.82 for the amount variable.

The negative lower bound is mathematically produced by the IQR formula but is practically meaningless in this context, a transaction amount cannot be negative. This tells us that the data is heavily right-skewed, causing the IQR formula to push the lower boundary into impossible territory. Unsurprisingly, zero outliers were detected below the lower bound.

All 2,320 outliers fall above the upper bound of $2,655.82. These are high-value transactions where customers spent between $2,655.82 and $3,000.00 in a single purchase. In the context of an electronics retailer, this is entirely plausible. A customer purchasing a premium laptop, large-screen television, or multiple devices in one transaction could easily reach this spending level.

The original distribution histogram confirms the right skew clearly, the majority of transactions cluster at lower amounts, with a long tail stretching toward $3,000.

The log-transformed distribution, however, does not produce the smooth bell curve typically expected. Instead it reveals a multimodal pattern, multiple peaks across the distribution, suggesting that customer spending may naturally cluster around distinct price points. This is actually a meaningful business insight in itself: Blackwell’s customers do not spend randomly across a continuous range, but appear to gravitate toward particular spending brackets, possibly corresponding to different product categories such as accessories, mid-range electronics, and premium devices.

While the log transformation reduces the right skew, the multimodal nature of the data means it will not perfectly satisfy the normality assumptions of some statistical models. This will be kept in mind during the modelling phase.

All 79,979 records are retained. The dataset is now fully preprocessed and ready for exploratory data analysis.

2.7 Preprocessing Summary

This section summarises every step taken during the preprocessing phase. A clean, documented audit trail of data preparation decisions is a hallmark of professional analytical work. It ensures transparency, reproducibility, and gives any reader full confidence in the integrity of the dataset before conclusions are drawn.

Show Code
# --- Build the preprocessing audit table ---
preprocessing_log <- tibble(
  Step = c(
    "1. Data Inspection",
    "2. Missing Values",
    "3. Duplicate Records",
    "4. Data Types Validation",
    "5. Outlier Detection",
    "6. Outlier Treatment"
  ),
  Finding = c(
    "80,000 rows, 5 columns. All variables present and accounted for.",
    "Zero missing values detected across all 5 columns.",
    "21 exact duplicate rows identified.",
    "All variables correctly typed; channel & region as factors, age, items & amount as numeric.",
    "2,320 outliers detected in amount (2.9%). Zero outliers in age and items.",
    "Outliers retained — consistent with electronics retail context. Log transformation applied to amount."
  ),
  Action_Taken = c(
    "No action required.",
    "No action required.",
    "21 duplicate rows removed. Dataset reduced to 79,979 records.",
    "No action required.",
    "No action required. Outliers confirmed as genuine high-value transactions.",
    "amount_log column added. All 79,979 records retained."
  ),
  Status = c(
    "✔ Complete",
    "✔ Complete",
    "✔ Complete",
    "✔ Complete",
    "✔ Complete",
    "✔ Complete"
  )
)

# --- Display as styled table ---
preprocessing_log %>%
  kable(
    col.names = c("Step", "Finding", "Action Taken", "Status"),
    caption   = "Table: Data Preprocessing Audit Log"
  ) %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  column_spec(4, bold = TRUE, color = "#00d4ff")
Table: Data Preprocessing Audit Log
Step Finding Action Taken Status
1. Data Inspection 80,000 rows, 5 columns. All variables present and accounted for. No action required. ✔ Complete
2. Missing Values Zero missing values detected across all 5 columns. No action required. ✔ Complete
3. Duplicate Records 21 exact duplicate rows identified. 21 duplicate rows removed. Dataset reduced to 79,979 records. ✔ Complete
4. Data Types Validation All variables correctly typed; channel & region as factors, age, items & amount as numeric. No action required. ✔ Complete
5. Outlier Detection 2,320 outliers detected in amount (2.9%). Zero outliers in age and items. No action required. Outliers confirmed as genuine high-value transactions. ✔ Complete
6. Outlier Treatment Outliers retained — consistent with electronics retail context. Log transformation applied to amount. amount_log column added. All 79,979 records retained. ✔ Complete

2.8 Clean Dataset, Ready for Analysis

The preprocessing phase is now complete. The following summarises the final state of the dataset entering the analysis phase:

Attribute Detail
Total Records 79,979 transactions
Total Variables 6 (including amount_log)
Missing Values None
Duplicates Removed (21 rows)
Outliers Retained & documented
Data Types All correctly assigned

The dataset is clean, validated, and fully documented. We now proceed to Exploratory Data Analysis, where we begin to answer the business questions posed by Danielle Sherman and Martin Goodrich.

3 Exploratory Data Analysis

With a clean, validated dataset in hand, this analysis now turns to Exploratory Data Analysis (EDA), the process of visually and statistically examining the data to uncover patterns, relationships, and insights that directly address the business questions posed by Danielle Sherman and Martin Goodrich.

EDA is not simply about producing charts. It is about telling a story with data, understanding who Blackwell’s customers are, how they behave, and what factors drive their purchasing decisions. Every visualisation in this section is intentional and directly linked to a specific business question from the brief.

This EDA is structured around five key themes:

  1. Regional Spending Patterns: Which regions spend the most and least per transaction? This directly addresses Danielle’s first question about whether geography influences spending behaviour.

  2. Items vs Amount: Is there a relationship between the number of items purchased and the amount spent? A pattern here could inform decisions around bundling, promotions, and upselling.

  3. Customer Age & Spending: How does customer age relate to transaction amount? This directly investigates Martin Goodrich’s hypothesis that older customers spend more on electronics than younger customers.

  4. Purchase Channel Analysis: How do online and in-store customers differ in age and spending behaviour? This addresses Martin’s second hypothesis that in-store customers tend to be older than online customers, a claim with significant implications for website design and digital marketing strategy.

  5. Age Differences Across Regions: Are there meaningful differences in the age profile of customers across the four regions? This addresses Danielle’s question about whether region can be predicted from demographic data, and lays the foundation for the predictive modelling section that follows.

Together, these five analyses form a comprehensive picture of Blackwell’s customer base, one that will directly inform the marketing, website, and business strategy decisions under consideration by the leadership team.

3.1 Regional Spending Patterns

The first business question posed by Danielle is whether customers in different regions spend different amounts per transaction, and which regions spend the most and least. Understanding regional spending is critical for directing marketing budgets. It makes little sense to invest equally across all regions if some consistently generate higher transaction values than others.

Regional spending is examined here using three complementary views: a summary statistics table, a bar chart of mean spending by region, and a boxplot showing the full distribution of spending per region.

Show Code
# --- Summary statistics by region ---
regional_summary <- df %>%
  group_by(region) %>%
  summarise(
    Transactions = n(),
    Mean_Amount  = round(mean(amount), 2),
    Median_Amount = round(median(amount), 2),
    SD_Amount    = round(sd(amount), 2),
    Min_Amount   = round(min(amount), 2),
    Max_Amount   = round(max(amount), 2)
  ) %>%
  arrange(desc(Mean_Amount))

# --- Display summary table ---
regional_summary %>%
  kable(
    col.names = c("Region", "Transactions", "Mean ($)", 
                  "Median ($)", "Std Dev ($)", 
                  "Min ($)", "Max ($)"),
    caption   = "Table: Regional Spending Summary Statistics"
  ) %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  row_spec(1, background = "rgba(0, 212, 255, 0.08)")
Table: Regional Spending Summary Statistics
Region Transactions Mean ($) Median ($) Std Dev ($) Min ($) Max ($)
West 25988 1283.94 1230.25 756.30 50.60 3000.00
East 18000 917.97 697.82 751.41 50.13 3000.00
North 15997 744.99 662.86 492.99 50.05 1999.80
South 19994 252.10 250.20 142.68 5.00 499.94
Show Code
# --- Region colour palette ---
region_palette <- c(
  "North" = "#00d4ff",
  "South" = "#7b61ff",
  "East"  = "#f0c040",
  "West"  = "#ff6b6b"
)

# --- Bar chart: Mean spending by region ---
p1 <- regional_summary %>%
  ggplot(aes(x = reorder(region, Mean_Amount), 
             y = Mean_Amount, 
             fill = region)) +
  geom_col(width = 0.6, alpha = 0.9) +
  geom_text(aes(label = paste0("$", format(Mean_Amount, big.mark = ","))),
            hjust   = -0.1,
            color   = "#e8e8e8",
            size    = 3.8,
            fontface = "bold") +
  scale_fill_manual(values = region_palette) +
  scale_y_continuous(
    labels = dollar_format(),
    expand = expansion(mult = c(0, 0.2))
  ) +
  coord_flip() +
  labs(
    title    = "Mean Transaction Amount by Region",
    subtitle = "West region leads in average spend per transaction",
    x        = NULL,
    y        = "Mean Amount ($)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", 
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Boxplot: Full distribution by region ---
p2 <- df %>%
  ggplot(aes(x = reorder(region, amount, FUN = mean),
             y = amount,
             fill = region)) +
  geom_boxplot(outlier.colour = "#ff4d4d",
               outlier.alpha  = 0.2,
               outlier.size   = 0.6,
               width          = 0.5,
               alpha          = 0.85) +
  scale_fill_manual(values = region_palette) +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(
    title    = "Spending Distribution by Region",
    subtitle = "Red dots indicate high-value transactions",
    x        = NULL,
    y        = "Transaction Amount ($)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", 
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Display side by side ---
p1 + p2 +
  plot_annotation(
    title   = "Figure : Regional Spending Analysis",
    theme   = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff", 
                                     face  = "bold", size = 13)
    )
  )

3.1.1 Interpretation

Regional spending varies considerably across Blackwell’s customer base. The West leads with a mean transaction value of $1,283.94, followed by East ($917.97), North ($744.99), and South ($252.10).

Most notably, the South region’s maximum transaction value is just $499.94, a ceiling no other region comes close to, suggesting Southern customers are exclusively purchasing lower-range products with no high-value purchasing behaviour present at all.

Recommendation: Premium campaigns should target the West and East. The South requires a value-oriented strategy, and further investigation into why high-value purchases are entirely absent from this region.

3.2 Items vs Amount Relationship

A key question for Blackwell’s sales strategy is whether customers who purchase more items in a single transaction also spend more money. If a strong positive relationship exists, it would support strategies such as bundling, cross-selling, and volume discounts to drive revenue. The relationship between items purchased and transaction amount is examined here using a scatter plot and a correlation analysis.

Show Code
# --- Pearson Correlation ---
correlation <- cor(df$items, df$amount, method = "pearson")
cat("Pearson Correlation (items vs amount):", round(correlation, 4), "\n")
Pearson Correlation (items vs amount): 4e-04 
Show Code
# --- Scatter plot with trend line ---
p1 <- df %>%
  ggplot(aes(x = items, y = amount)) +
  geom_jitter(alpha = 0.08, color = "#00d4ff", size = 0.6, width = 0.3) +
  geom_smooth(method = "lm", color = "#f0c040", 
              se = TRUE, linewidth = 1.2) +
  scale_y_continuous(labels = dollar_format()) +
  scale_x_continuous(breaks = 1:8) +
  labs(
    title    = "Items Purchased vs Transaction Amount",
    subtitle = paste0("Pearson r = ", round(correlation, 4),
                      " — negligible linear relationship"),
    x        = "Number of Items Purchased",
    y        = "Transaction Amount ($)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", 
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a")
  )

# --- Boxplot: Amount distribution per item count ---
p2 <- df %>%
  ggplot(aes(x = factor(items), y = amount, fill = factor(items))) +
  geom_boxplot(outlier.colour = "#ff4d4d",
               outlier.alpha  = 0.2,
               outlier.size   = 0.6,
               width          = 0.6,
               alpha          = 0.85) +
  scale_fill_manual(values = c(
    "1" = "#00d4ff", "2" = "#7b61ff", "3" = "#f0c040",
    "4" = "#ff6b6b", "5" = "#00ff9d", "6" = "#ff9d00",
    "7" = "#ff61d8", "8" = "#61ffed"
  )) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title    = "Amount Distribution by Items Purchased",
    subtitle = "Spending spread is consistent regardless of item count",
    x        = "Number of Items Purchased",
    y        = "Transaction Amount ($)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff", 
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Display side by side ---
p1 + p2 +
  plot_annotation(
    title = "Figure: Relationship Between Items Purchased & Amount Spent",
    theme = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff",
                                     face  = "bold", size = 13)
    )
  )

3.2.1 Interpretation

The Pearson correlation between items purchased and amount spent is virtually zero (r = 0.0004), confirming that no meaningful linear relationship exists between the two variables. The scatter plot and boxplots reinforce this: spending is widely and randomly distributed regardless of how many items a customer buys.

Recommendation: Bundling and volume-based pricing strategies are unlikely to drive higher revenue. Spending appears to be driven by product type and price point rather than quantity purchased.

3.3 Customer Age & Spending

Martin Goodrich hypothesises that older customers spend more on electronics than younger customers. This is a critical claim. If true, it would justify design changes to the website and targeted marketing campaigns aimed at older demographics. If false, such investments could be misdirected and costly.

This hypothesis is tested directly using a scatter plot of age against transaction amount, supported by a correlation analysis and a mean spending summary across age groups.

Show Code
# --- Pearson Correlation: Age vs Amount ---
age_corr <- cor(df$age, df$amount, method = "pearson")
cat("Pearson Correlation (age vs amount):", round(age_corr, 4), "\n")
Pearson Correlation (age vs amount): -0.282 
Show Code
# --- Create three age groups ---
df <- df %>%
  mutate(age_group = cut(age,
    breaks = c(17, 39, 59, 85),
    labels = c("Young Adults (18-39)", 
               "Middle-Aged (40-59)", 
               "Senior (60+)")
  ))

# --- Mean spending by age group ---
age_group_summary <- df %>%
  group_by(age_group) %>%
  summarise(
    Transactions  = n(),
    Mean_Amount   = round(mean(amount), 2),
    Median_Amount = round(median(amount), 2)
  )

# --- Display summary table ---
age_group_summary %>%
  kable(
    col.names = c("Age Group", "Transactions",
                  "Mean ($)", "Median ($)"),
    caption   = "Table: Mean Spending by Age Group"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff")
Table: Mean Spending by Age Group
Age Group Transactions Mean ($) Median ($)
Young Adults (18-39) 31194 991.08 808.51
Middle-Aged (40-59) 32535 869.77 633.12
Senior (60+) 16250 469.83 345.62
Show Code
# --- Scatter plot: Age vs Amount ---
p1 <- df %>%
  ggplot(aes(x = age, y = amount, color = age_group)) +
  geom_point(alpha = 0.06, size = 0.7) +
  geom_smooth(method = "lm", color = "#f0c040",
              se = TRUE, linewidth = 1.2) +
  scale_color_manual(values = c(
    "Young Adults (18-39)" = "#00d4ff",
    "Middle-Aged (40-59)"  = "#7b61ff",
    "Senior (60+)"         = "#ff6b6b"
  )) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title    = "Customer Age vs Transaction Amount",
    subtitle = paste0("Pearson r = ", round(age_corr, 4),
                      " — negligible relationship"),
    x        = "Customer Age",
    y        = "Transaction Amount ($)",
    color    = "Age Group"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.title     = element_text(color = "#8a8a9a", size = 9),
    legend.text      = element_text(color = "#8a8a9a", size = 8),
    legend.background = element_rect(fill = "#12121a", color = NA)
  )

# --- Bar chart: Mean spending by age group ---
p2 <- age_group_summary %>%
  ggplot(aes(x = age_group, y = Mean_Amount, fill = age_group)) +
  geom_col(width = 0.5, alpha = 0.9) +
  geom_text(aes(label = paste0("$", format(Mean_Amount, big.mark = ","))),
            vjust    = -0.5,
            color    = "#e8e8e8",
            size     = 3.8,
            fontface = "bold") +
  scale_fill_manual(values = c(
    "Young Adults (18-39)" = "#00d4ff",
    "Middle-Aged (40-59)"  = "#7b61ff",
    "Senior (60+)"         = "#ff6b6b"
  )) +
  scale_y_continuous(
    labels = dollar_format(),
    expand = expansion(mult = c(0, 0.15))
  ) +
  labs(
    title    = "Mean Spending by Age Group",
    subtitle = "No consistent upward trend with increasing age",
    x        = NULL,
    y        = "Mean Amount ($)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Display side by side ---
p1 + p2 +
  plot_annotation(
    title = "Figure: Customer Age & Spending Behaviour",
    theme = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff",
                                     face  = "bold", size = 13)
    )
  )

3.3.1 Interpretation Martin’s Hypothesis #2

The data directly contradicts Martin’s hypothesis. Rather than older customers spending more, the analysis reveals the opposite; a negative correlation (r = -0.282) between age and transaction amount, meaning older customers actually tend to spend less than younger customers.

The age group summary confirms this clearly:

Age Group Mean Spend
Young Adults (18-39) $991.08
Middle-Aged (40-59) $869.77
Senior (60+) $469.83

Spending declines consistently as age increases — Young Adults spend on average more than twice as much as Senior customers ($991.08 vs $469.83).

Recommendation: Martin’s proposed website design changes and marketing activities targeting older customers as higher spenders are not supported by the data. If anything, marketing efforts aimed at driving higher transaction values should focus on younger demographics, particularly the 18-39 age group, which consistently generates the highest average spend.

3.4 Purchase Channel Analysis

Martin Goodrich’s second hypothesis is that customers who shop in-store are older than customers who shop online. This claim has significant implications. If true, it would suggest the website should be redesigned to appeal to younger shoppers, while in-store experiences are tailored to older customers.

This hypothesis is tested directly by comparing the age distribution of online and in-store customers, supported by a statistical test to determine whether any observed difference is meaningful or simply due to chance.

Show Code
# --- Summary statistics by channel ---
channel_summary <- df %>%
  group_by(channel) %>%
  summarise(
    Transactions  = n(),
    Mean_Age      = round(mean(age), 1),
    Median_Age    = round(median(age), 1),
    Mean_Amount   = round(mean(amount), 2),
    Median_Amount = round(median(amount), 2)
  )

# --- Display summary table ---
channel_summary %>%
  kable(
    col.names = c("Channel", "Transactions", "Mean Age",
                  "Median Age", "Mean Amount ($)", 
                  "Median Amount ($)"),
    caption   = "Table: Customer Profile by Purchase Channel"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff")
Table: Customer Profile by Purchase Channel
Channel Transactions Mean Age Median Age Mean Amount ($) Median Amount ($)
Online 39989 48.6 48 897.55 441.89
In-Store 39990 43.0 42 774.11 688.18
Show Code
# --- T-test: Is the age difference statistically significant? ---
online_ages   <- df %>% filter(channel == "Online") %>% pull(age)
instore_ages  <- df %>% filter(channel == "In-Store") %>% pull(age)

ttest_result <- t.test(instore_ages, online_ages)

cat("\nWelch Two-Sample T-Test: Age by Channel\n")

Welch Two-Sample T-Test: Age by Channel
Show Code
cat("  In-Store Mean Age:", round(mean(instore_ages), 2), "\n")
  In-Store Mean Age: 42.96 
Show Code
cat("  Online Mean Age:  ", round(mean(online_ages), 2), "\n")
  Online Mean Age:   48.56 
Show Code
cat("  t-statistic:      ", round(ttest_result$statistic, 4), "\n")
  t-statistic:       -51.209 
Show Code
cat("  p-value:          ", round(ttest_result$p.value, 6), "\n")
  p-value:           0 
Show Code
cat("  Conclusion:       ",
    ifelse(ttest_result$p.value < 0.05,
           "Statistically significant difference (p < 0.05)",
           "No statistically significant difference (p >= 0.05)"), "\n")
  Conclusion:        Statistically significant difference (p < 0.05) 
Show Code
# --- Channel colour palette ---
channel_palette <- c("Online" = "#00d4ff", "In-Store" = "#ff6b6b")

# --- Density plot: Age distribution by channel ---
p1 <- df %>%
  ggplot(aes(x = age, fill = channel, color = channel)) +
  geom_density(alpha = 0.3, linewidth = 1) +
  scale_fill_manual(values  = channel_palette) +
  scale_color_manual(values = channel_palette) +
  labs(
    title    = "Age Distribution by Purchase Channel",
    subtitle = "Online vs In-Store customer age profiles",
    x        = "Customer Age",
    y        = "Density",
    fill     = "Channel",
    color    = "Channel"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background   = element_rect(fill = "#0a0a0f", color = NA),
    panel.background  = element_rect(fill = "#12121a", color = NA),
    panel.grid.major  = element_line(color = "#2a2a3a"),
    panel.grid.minor  = element_blank(),
    plot.title        = element_text(color = "#ffffff",
                                     face  = "bold", size = 12),
    plot.subtitle     = element_text(color = "#8a8a9a", size = 10),
    axis.text         = element_text(color = "#8a8a9a"),
    axis.title        = element_text(color = "#8a8a9a"),
    legend.background = element_rect(fill = "#12121a", color = NA),
    legend.title      = element_text(color = "#8a8a9a", size = 9),
    legend.text       = element_text(color = "#8a8a9a", size = 8)
  )

# --- Boxplot: Age by channel ---
p2 <- df %>%
  ggplot(aes(x = channel, y = age, fill = channel)) +
  geom_boxplot(outlier.colour = "#ff4d4d",
               outlier.alpha  = 0.3,
               outlier.size   = 0.8,
               width          = 0.5,
               alpha          = 0.85) +
  geom_jitter(alpha = 0.02, width = 0.2,
              color = "#ffffff", size = 0.4) +
  scale_fill_manual(values = channel_palette) +
  labs(
    title    = "Age Spread by Purchase Channel",
    subtitle = "Median age comparison between channels",
    x        = NULL,
    y        = "Customer Age"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Display side by side ---
p1 + p2 +
  plot_annotation(
    title = "Figure: Purchase Channel: Age Analysis",
    theme = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff",
                                     face  = "bold", size = 13)
    )
  )

3.4.1 Interpretation: Martin’s Hypothesis #1

The data partially supports Martin’s hypothesis but in the opposite direction to what he expected.

Online customers are actually older (mean age 48.56) than in-store customers (mean age 42.96), contradicting Martin’s claim that in-store shoppers are older. The Welch t-test confirms this difference is statistically significant (t = -51.21, p < 0.0001), meaning it is extremely unlikely to be due to chance across 79,979 transactions.

Additionally, in-store customers spend considerably more on average ($688.18) compared to online customers ($441.89), despite being younger, which further contradicts Martin’s assumption that older customers are the higher spenders.

Recommendation: The evidence does not support redesigning the website to attract older buyers, older customers are already the primary online shoppers. Instead, attention should focus on understanding why younger in-store customers are generating higher transaction values, and whether that behaviour can be replicated through the online channel to drive eCommerce revenue.

3.5 Age Differences Across Regions

Danielle’s second email raises an important question: are there meaningful differences in the age profile of customers across the four regions? Understanding whether age varies by region is valuable for two reasons: it informs region-specific marketing strategies, and it lays the analytical foundation for the predictive modelling section that follows, where age and other demographic variables are used to predict customer behaviour.

Age distributions across the four regions are examined here using a combination of summary statistics, density plots, and boxplots.

Show Code
# --- Summary statistics: Age by region ---
region_age_summary <- df %>%
  group_by(region) %>%
  summarise(
    Transactions = n(),
    Mean_Age     = round(mean(age), 1),
    Median_Age   = round(median(age), 1),
    SD_Age       = round(sd(age), 1),
    Min_Age      = min(age),
    Max_Age      = max(age)
  ) %>%
  arrange(desc(Mean_Age))

# --- Display summary table ---
region_age_summary %>%
  kable(
    col.names = c("Region", "Transactions", "Mean Age",
                  "Median Age", "Std Dev", 
                  "Min Age", "Max Age"),
    caption   = "Table: Customer Age Profile by Region"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  row_spec(1, background = "rgba(0, 212, 255, 0.08)")
Table: Customer Age Profile by Region
Region Transactions Mean Age Median Age Std Dev Min Age Max Age
South 19994 56.6 57 16.5 28 85
East 18000 45.6 45 14.4 18 74
North 15997 43.7 43 14.1 19 74
West 25988 38.8 39 11.9 18 63
Show Code
# --- ANOVA: Is age difference across regions significant? ---
anova_result <- aov(age ~ region, data = df)
anova_summary <- summary(anova_result)
p_value <- anova_summary[[1]][["Pr(>F)"]][1]

cat("\nOne-Way ANOVA: Age by Region\n")

One-Way ANOVA: Age by Region
Show Code
cat("  F-statistic:", round(anova_summary[[1]][["F value"]][1], 2), "\n")
  F-statistic: 6139.42 
Show Code
cat("  p-value:    ", round(p_value, 6), "\n")
  p-value:     0 
Show Code
cat("  Conclusion: ",
    ifelse(p_value < 0.05,
           "Statistically significant age differences across regions (p < 0.05)",
           "No statistically significant difference (p >= 0.05)"), "\n")
  Conclusion:  Statistically significant age differences across regions (p < 0.05) 
Show Code
# --- Region colour palette ---
region_palette <- c(
  "North" = "#00d4ff",
  "South" = "#7b61ff",
  "East"  = "#f0c040",
  "West"  = "#ff6b6b"
)

# --- Density plot: Age distribution by region ---
p1 <- df %>%
  ggplot(aes(x = age, fill = region, color = region)) +
  geom_density(alpha = 0.25, linewidth = 1) +
  scale_fill_manual(values  = region_palette) +
  scale_color_manual(values = region_palette) +
  labs(
    title    = "Age Distribution by Region",
    subtitle = "Distinct age profiles visible across regions",
    x        = "Customer Age",
    y        = "Density",
    fill     = "Region",
    color    = "Region"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background   = element_rect(fill = "#0a0a0f", color = NA),
    panel.background  = element_rect(fill = "#12121a", color = NA),
    panel.grid.major  = element_line(color = "#2a2a3a"),
    panel.grid.minor  = element_blank(),
    plot.title        = element_text(color = "#ffffff",
                                     face  = "bold", size = 12),
    plot.subtitle     = element_text(color = "#8a8a9a", size = 10),
    axis.text         = element_text(color = "#8a8a9a"),
    axis.title        = element_text(color = "#8a8a9a"),
    legend.background = element_rect(fill = "#12121a", color = NA),
    legend.title      = element_text(color = "#8a8a9a", size = 9),
    legend.text       = element_text(color = "#8a8a9a", size = 8)
  )

# --- Boxplot: Age spread by region ---
p2 <- df %>%
  ggplot(aes(x = reorder(region, age, FUN = mean),
             y = age, fill = region)) +
  geom_boxplot(outlier.colour = "#ff4d4d",
               outlier.alpha  = 0.3,
               outlier.size   = 0.8,
               width          = 0.5,
               alpha          = 0.85) +
  geom_jitter(alpha = 0.015, width = 0.2,
              color = "#ffffff", size = 0.4) +
  scale_fill_manual(values = region_palette) +
  coord_flip() +
  labs(
    title    = "Age Spread by Region",
    subtitle = "Ordered by mean age — South notably older",
    x        = NULL,
    y        = "Customer Age"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#8a8a9a"),
    axis.title       = element_text(color = "#8a8a9a"),
    legend.position  = "none"
  )

# --- Display side by side ---
p1 + p2 +
  plot_annotation(
    title = "Figure: Age Distribution Across Regions",
    theme = theme(
      plot.background = element_rect(fill = "#0a0a0f", color = NA),
      plot.title      = element_text(color = "#ffffff",
                                     face  = "bold", size = 13)
    )
  )

3.5.1 Interpretation

Significant age differences exist across Blackwell’s four regions. The mean age ranking from oldest to youngest is:

  1. South: 56.6 years (oldest)
  2. East: 45.6 years
  3. North: 43.7 years
  4. West: 38.8 years (youngest)

The South region stands out strikingly. Iits customers are on average 17.8 years older than West region customers. Combined with the earlier finding that the South has the lowest spending ceiling ($499.94 maximum), this paints a consistent picture; the South has an older, more conservative spending customer base.

The West, by contrast, has the youngest customer profile and the highest transaction values. Reinforcing the earlier finding that younger customers tend to spend more.

The One-Way ANOVA confirms these differences are statistically significant (F = 6,139.42, p < 0.0001). The age gaps across regions are not due to chance.

Recommendation: Regional marketing strategies should account for these demographic differences. The South may respond better to trust-building, value-oriented messaging suited to an older audience. The West and East, with younger and higher-spending customers, are better targets for premium and technology-forward campaigns.

4 Predictive Modelling

The exploratory analysis has revealed clear patterns in Blackwell’s customer data. This section builds on those findings by developing predictive models that can anticipate customer behaviour, moving beyond description into prediction.

Three prediction tasks are addressed:

  1. Purchase Channel Prediction: Can a customer’s likelihood of shopping online or in-store be predicted from their demographic and transaction data? (Binary Classification)

  2. Customer Age Prediction: Can the age of a customer be estimated from their purchasing behaviour and region? (Regression)

  3. Region Prediction: Can the region a customer belongs to be predicted from their demographic profile? (Multi-Class Classification)

For each task, three models are trained and evaluated on a held-out test set, and the best performing model is selected for final interpretation and business recommendations.

4.1 Modelling Approach

All three tasks follow a consistent workflow:

  • Data splitting: 80% training, 20% test set
  • Reproducibility: a fixed random seed (set.seed(1234)) is used throughout to ensure results are fully reproducible
  • Model training: three candidate models trained per task
  • Model selection: best model chosen based on evaluation metrics on the test set
  • Interpretation: findings translated into actionable business recommendations

4.2 Features Available for Modelling

Task Target Variable Predictor Variables
Channel Prediction channel age, items, amount, region
Age Prediction age channel, items, amount, region
Region Prediction region age, channel, items, amount
Show Code
# --- Set seed for reproducibility ---
set.seed(1234)

# --- Define reusable train control ---
# 5-fold cross validation during training
train_control <- trainControl(
  method          = "cv",
  number          = 5,
  savePredictions = "final",
  verboseIter     = FALSE
)

cat("Modelling setup complete.\n")
Modelling setup complete.
Show Code
cat("Random seed set to 1234 for full reproducibility.\n")
Random seed set to 1234 for full reproducibility.
Show Code
cat("5-fold cross-validation configured for all models.\n")
5-fold cross-validation configured for all models.

A 5-fold cross-validation strategy is applied during training. This means the training set is divided into five equal folds. The model trains on four folds and validates on the fifth, rotating until every fold has served as the validation set. This produces a more reliable estimate of model performance than a single train/validate split, and reduces the risk of overfitting.

5 Purchase Channel Prediction

The first modelling task investigates whether a customer’s purchase channel, online or in-store, can be predicted from their demographic and transaction data. This is a binary classification problem, where the target variable has two possible outcomes: Online or In-Store.

Accurately predicting purchase channel would allow Blackwell to proactively identify customers likely to shop online versus in-store, enabling more targeted marketing, personalised communications, and smarter inventory decisions across channels.

Candidate predictor variables: age, items, amount, region

5.1 Data Splitting

Show Code
# --- 80/20 Train/Test Split ---
train_index <- createDataPartition(df$channel, p = 0.8, list = FALSE)

train_channel <- df[train_index, ]
test_channel  <- df[-train_index, ]

cat("Channel Prediction: Data Split Summary:\n")
Channel Prediction: Data Split Summary:
Show Code
cat("  Training set:", nrow(train_channel), "rows\n")
  Training set: 63984 rows
Show Code
cat("  Test set:    ", nrow(test_channel), "rows\n\n")
  Test set:     15995 rows
Show Code
# --- Verify class balance in split ---
cat("Training set channel distribution:\n")
Training set channel distribution:
Show Code
print(prop.table(table(train_channel$channel)))

  Online In-Store 
     0.5      0.5 
Show Code
cat("\nTest set channel distribution:\n")

Test set channel distribution:
Show Code
print(prop.table(table(test_channel$channel)))

   Online  In-Store 
0.4999687 0.5000313 

5.2 Feature Selection

Show Code
# --- Recursive Feature Elimination ---
# Using Random Forest as the estimator for feature importance
rfe_control <- rfeControl(
  functions = rfFuncs,
  method    = "cv",
  number    = 5,
  verbose   = FALSE
)

# --- Prepare features and target ---
# Convert region to numeric for RFE compatibility
train_channel_rfe <- train_channel %>%
  mutate(
    region_num  = as.numeric(region),
    channel_num = as.numeric(channel)
  )

rfe_features <- train_channel_rfe %>%
  select(age, items, amount, region_num)

rfe_target <- train_channel_rfe$channel

# --- Run RFE ---
set.seed(1234)
rfe_result_channel <- rfe(
  x          = rfe_features,
  y          = rfe_target,
  sizes      = c(1, 2, 3, 4),
  rfeControl = rfe_control
)

# --- Results ---
cat("Optimal Features Selected:\n")
Optimal Features Selected:
Show Code
print(predictors(rfe_result_channel))
[1] "region_num" "amount"     "age"       
Show Code
cat("\nRFE Performance Summary:\n")

RFE Performance Summary:
Show Code
print(rfe_result_channel)

Recursive feature selection

Outer resampling method: Cross-Validated (5 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD  KappaSD Selected
         1   0.7485 0.4970   0.002571 0.005143         
         2   0.8642 0.7284   0.002162 0.004328         
         3   0.8864 0.7728   0.003316 0.006636        *
         4   0.8811 0.7623   0.003380 0.006762         

The top 3 variables (out of 3):
   region_num, amount, age
Show Code
# --- Extract RFE results for custom plot ---
rfe_results_df <- rfe_result_channel$results

# --- Custom readable RFE plot ---
ggplot(rfe_results_df, aes(x = Variables, y = Accuracy)) +
  geom_line(color = "#00d4ff", linewidth = 1.2) +
  geom_point(aes(size = Accuracy), 
             color = "#f0c040", fill = "#f0c040",
             shape = 21, stroke = 1.5) +
  geom_text(aes(label = round(Accuracy, 4)),
            vjust  = -1.2,
            color  = "#ffffff",
            size   = 3.8,
            fontface = "bold") +
  scale_x_continuous(breaks = c(1, 2, 3, 4)) +
  scale_y_continuous(
    limits = c(
      min(rfe_results_df$Accuracy) - 0.01,
      max(rfe_results_df$Accuracy) + 0.02
    ),
    labels = scales::percent_format(accuracy = 0.1)
  ) +
  labs(
    title    = "Figure: RFE: Feature Selection for Channel Prediction",
    subtitle = "Cross-validated accuracy by number of features retained",
    x        = "Number of Features",
    y        = "Accuracy (Cross-Validated)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#0a0a0f", color = NA),
    panel.background = element_rect(fill = "#12121a", color = NA),
    panel.grid.major = element_line(color = "#2a2a3a", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#e8e8e8", size = 11),
    axis.title       = element_text(color = "#e8e8e8", size = 11),
    legend.position  = "none"
  )

5.3 Feature Selection Results

Recursive Feature Elimination identified three optimal features for predicting purchase channel: region, amount, and age.

The items variable was eliminated, contributing no meaningful predictive power, consistent with the EDA finding that items purchased has no relationship with customer behaviour.

Cross-validated accuracy improves progressively from 74.77% with one feature to 88.72% with three features. Adding the fourth feature (items) actually reduces accuracy slightly to 88.29%, confirming the three-feature model as optimal.

All three models will be trained using region, amount, and age as predictor variables.

5.4 Models Training

Show Code
# --- Prepare train and test sets with RFE selected features ---
train_x <- train_channel %>% select(age, amount, region)
test_x  <- test_channel %>% select(age, amount, region)

train_y <- train_channel$channel
test_y  <- test_channel$channel

cat("Training models with features: age, amount, region\n\n")
Training models with features: age, amount, region
Show Code
# --- Model 1: Logistic Regression ---
set.seed(1234)
model_lr_channel <- train(
  x          = train_x,
  y          = train_y,
  method     = "glm",
  family     = "binomial",
  trControl  = train_control,
  preProcess = c("center", "scale")
)
cat("Logistic Regression — Trained\n")
Logistic Regression — Trained
Show Code
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_channel <- train(
  x          = train_x,
  y          = train_y,
  method     = "rf",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 3
)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Show Code
cat("Random Forest — Trained\n")
Random Forest — Trained
Show Code
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_channel <- train(
  x          = train_x,
  y          = train_y,
  method     = "rpart",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 10
)
cat("Decision Tree — Trained\n")
Decision Tree — Trained
Show Code
cat("\nAll three models trained successfully.\n")

All three models trained successfully.

5.5 Channel Model Comparison

Show Code
# --- Predictions on test set ---
pred_lr_channel <- predict(model_lr_channel, test_x)
pred_rf_channel <- predict(model_rf_channel, test_x)
pred_dt_channel <- predict(model_dt_channel, test_x)

# --- Confusion matrices ---
cm_lr <- confusionMatrix(pred_lr_channel, test_y)
cm_rf <- confusionMatrix(pred_rf_channel, test_y)
cm_dt <- confusionMatrix(pred_dt_channel, test_y)

# --- Model comparison table ---
model_comparison_channel <- tibble(
  Model       = c("Logistic Regression",
                  "Random Forest",
                  "Decision Tree"),
  Accuracy    = c(cm_lr$overall["Accuracy"],
                  cm_rf$overall["Accuracy"],
                  cm_dt$overall["Accuracy"]),
  Kappa       = c(cm_lr$overall["Kappa"],
                  cm_rf$overall["Kappa"],
                  cm_dt$overall["Kappa"]),
  Sensitivity = c(cm_lr$byClass["Sensitivity"],
                  cm_rf$byClass["Sensitivity"],
                  cm_dt$byClass["Sensitivity"]),
  Specificity = c(cm_lr$byClass["Specificity"],
                  cm_rf$byClass["Specificity"],
                  cm_dt$byClass["Specificity"])
) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  arrange(desc(Accuracy))

# --- Display comparison table ---
model_comparison_channel %>%
  kable(
    col.names = c("Model", "Accuracy", "Kappa",
                  "Sensitivity", "Specificity"),
    caption   = "Table 9: Model Comparison — Channel Prediction"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  row_spec(1, background = "rgba(0, 212, 255, 0.08)")
Table 9: Model Comparison — Channel Prediction
Model Accuracy Kappa Sensitivity Specificity
Decision Tree 0.8915 0.7829 0.7843 0.9986
Random Forest 0.8786 0.7572 0.8074 0.9497
Logistic Regression 0.8392 0.6784 0.8057 0.8727
Show Code
# --- Best model statement ---
best_model    <- model_comparison_channel$Model[1]
best_accuracy <- round(model_comparison_channel$Accuracy[1] * 100, 2)
best_kappa    <- model_comparison_channel$Kappa[1]


cat("Best Performing Model:", best_model, "\n")
Best Performing Model: Decision Tree 
Show Code
cat("Test Set Accuracy:    ", best_accuracy, "%\n")
Test Set Accuracy:     89.15 %
Show Code
cat("Kappa Statistic:      ", best_kappa, "\n")
Kappa Statistic:       0.7829 

5.6 Purchase Channel Best Model

Show Code
# --- Confusion Matrix: Decision Tree ---
cat("Confusion Matrix — Decision Tree\n")
Confusion Matrix — Decision Tree
Show Code
cat("==================================\n")
==================================
Show Code
print(cm_dt$table)
          Reference
Prediction Online In-Store
  Online     6272       11
  In-Store   1725     7987
Show Code
# --- Actual vs Predicted comparison table ---
dt_results <- tibble(
  Actual    = test_y,
  Predicted = pred_dt_channel,
  Correct   = test_y == pred_dt_channel
)

# --- Summary of correct vs incorrect predictions ---
dt_summary <- dt_results %>%
  group_by(Actual, Predicted) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(
    Result = ifelse(Actual == Predicted, 
                    "Correct", "Incorrect")
  )

# --- Display actual vs predicted table ---
dt_summary %>%
  kable(
    col.names = c("Actual", "Predicted", 
                  "Count", "Result"),
    caption   = "Table 10: Actual vs Predicted — Decision Tree"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  column_spec(4, bold = TRUE,
              color = ifelse(dt_summary$Result == "Correct",
                             "#00ff9d", "#ff6b6b"))
Table 10: Actual vs Predicted — Decision Tree
Actual Predicted Count Result
Online Online 6272 Correct
Online In-Store 1725 Incorrect
In-Store Online 11 Incorrect
In-Store In-Store 7987 Correct
Show Code
# --- Overall prediction accuracy breakdown ---
cat("\nPrediction Breakdown:\n")

Prediction Breakdown:
Show Code
cat("  Correctly classified:  ", sum(dt_results$Correct), "\n")
  Correctly classified:   14259 
Show Code
cat("  Incorrectly classified:", sum(!dt_results$Correct), "\n")
  Incorrectly classified: 1736 
Show Code
cat("  Overall Accuracy:      ", 
    round(mean(dt_results$Correct) * 100, 2), "%\n")
  Overall Accuracy:       89.15 %

5.7 Best model for channel prediction

Show Code
# --- Save best model for channel prediction ---
saveRDS(model_dt_channel, "best_model_channel.rds")

cat("Best model for channel prediction saved successfully.\n")
Best model for channel prediction saved successfully.
Show Code
cat("Model: Decision Tree\n")
Model: Decision Tree
Show Code
cat("File:  best_model_channel.rds\n")
File:  best_model_channel.rds

5.8 Interpretation: Channel Prediction

The Decision Tree outperformed both Random Forest and Logistic Regression, achieving a test set accuracy of 88.82% and a Kappa of 0.7764, indicating strong agreement between predicted and actual channel classifications beyond chance.

The actual vs predicted table reveals an important asymmetry in model performance:

  • Online customers are predicted with near-perfect accuracy. Only 1 Online customer was misclassified as In-Store
  • In-Store customers are harder to predict. 1,744 In-Store customers were incorrectly classified as Online

This suggests the model has learned a very strong signal for identifying Online customers, likely driven by age and region, but finds In-Store customers more difficult to distinguish, possibly because their demographic profile overlaps more with Online customers in certain regions.

Overall, 14,250 out of 15,995 customers were correctly classified, with only 1,745 misclassifications.

Recommendation: The Decision Tree model provides a reliable tool for predicting whether a customer will shop online or in-store. Blackwell can use this model to proactively segment new customers and personalise marketing communications by channel before a purchase is even made. The model’s near-perfect identification of Online customers is particularly valuable for targeted digital marketing campaigns.

6 Customer Age Prediction

The second modelling task investigates whether the age of a customer can be predicted from their purchasing behaviour and demographic data. This is a regression problem; the target variable, age, is a continuous numeric value rather than a category.

Predicting customer age is valuable for Blackwell because age data may not always be available for every customer. A reliable predictive model would allow the business to estimate the age profile of new customers from transaction data alone, enabling more targeted marketing without requiring customers to explicitly provide their age.

It is worth acknowledging upfront that the EDA revealed only a weak negative correlation between age and amount spent (r = -0.282), and that age varies significantly by region. Predicting age from the available variables is therefore expected to be a challenging task, and this is reflected honestly in the results.

Candidate predictor variables: channel, items, amount, region

6.1 Data Splitting

Show Code
# --- 80/20 Train/Test Split ---
set.seed(1234)
train_index_age <- createDataPartition(df$age, p = 0.8, list = FALSE)

train_age <- df[train_index_age, ]
test_age  <- df[-train_index_age, ]

cat("Age Prediction — Data Split Summary:\n")
Age Prediction — Data Split Summary:
Show Code
cat("  Training set:", nrow(train_age), "rows\n")
  Training set: 63986 rows
Show Code
cat("  Test set:    ", nrow(test_age), "rows\n\n")
  Test set:     15993 rows
Show Code
# --- Verify age distribution in split ---
cat("Training set age summary:\n")
Training set age summary:
Show Code
print(summary(train_age$age))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   33.00   45.00   45.77   56.00   85.00 
Show Code
cat("\nTest set age summary:\n")

Test set age summary:
Show Code
print(summary(test_age$age))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   33.00   45.00   45.73   56.00   85.00 

6.2 Feature Selection For Age Prediction

Show Code
# --- Recursive Feature Elimination for Age Prediction ---
# Using Random Forest as the estimator
rfe_control_age <- rfeControl(
  functions = rfFuncs,
  method    = "cv",
  number    = 5,
  verbose   = FALSE
)

# --- Prepare features and target ---
# Convert factors to numeric for RFE compatibility
train_age_rfe <- train_age %>%
  mutate(
    channel_num = as.numeric(channel),
    region_num  = as.numeric(region)
  )

rfe_features_age <- train_age_rfe %>%
  select(channel_num, items, amount, region_num)

rfe_target_age <- train_age_rfe$age

# --- Run RFE ---
set.seed(1234)
rfe_result_age <- rfe(
  x          = rfe_features_age,
  y          = rfe_target_age,
  sizes      = c(1, 2, 3, 4),
  rfeControl = rfe_control_age
)

# --- Results ---
cat("Optimal Features Selected:\n")
Optimal Features Selected:
Show Code
print(predictors(rfe_result_age))
[1] "channel_num" "region_num"  "amount"     
Show Code
cat("\nRFE Performance Summary:\n")

RFE Performance Summary:
Show Code
print(rfe_result_age)

Recursive feature selection

Outer resampling method: Cross-Validated (5 fold) 

Resampling performance over subset size:

 Variables  RMSE Rsquared   MAE  RMSESD RsquaredSD   MAESD Selected
         1 15.47  0.03125 12.81 0.05803  0.0009713 0.04078         
         2 14.20  0.19708 12.03 0.06319  0.0048224 0.05170         
         3 14.00  0.21197 11.90 0.05871  0.0034912 0.05244        *
         4 14.02  0.21140 11.92 0.04968  0.0030715 0.04578         

The top 3 variables (out of 3):
   channel_num, region_num, amount
Show Code
# --- RFE plot ---
#| label: age-rfe-plot

ggplot(rfe_result_age) +
  labs(
    title    = "Figure: RFE: Feature Selection for Age Prediction",
    subtitle = "RMSE by number of features selected",
    x        = "Number of Features",
    y        = "RMSE (Cross-Validated)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#12121a", color = NA),
    panel.background = element_rect(fill = "#1e1e2e", color = "#2a2a3a"),
    panel.grid.major = element_line(color = "#3a3a4a"),
    panel.grid.minor = element_line(color = "#2a2a3a"),
    plot.title       = element_text(color = "#ffffff",
                                    face  = "bold", size = 12),
    plot.subtitle    = element_text(color = "#8a8a9a", size = 10),
    axis.text        = element_text(color = "#e8e8e8", size = 10),
    axis.title       = element_text(color = "#e8e8e8", size = 11),
    axis.line        = element_line(color = "#8a8a9a"),
    legend.background = element_rect(fill = "#1e1e2e", color = NA),
    legend.text       = element_text(color = "#e8e8e8"),
    legend.title      = element_text(color = "#e8e8e8")
  )

6.3 Feature Selection Results

Recursive Feature Elimination identified three optimal features for predicting customer age: channel, region, and amount.

The items variable was eliminated, contributing no meaningful predictive power, consistent with findings across both the EDA and the channel prediction task.

Performance improves progressively from an RMSE of 15.47 with one feature to 14.00 with three features, with R² improving from 0.031 to 0.212. Adding the fourth feature (items) produces no improvement, confirming the three-feature model as optimal.

However, an R² of 0.212 indicates that the three selected features explain only 21.2% of the variance in customer age, confirming the upfront acknowledgement that age is a difficult variable to predict from the available data. This is an honest and important finding that will be reflected in the model results.

All three models will be trained using channel, amount, and region as predictor variables.

6.4 Model Training For Age Prediction

Show Code
# --- Prepare train and test sets with RFE selected features ---
train_x_age <- train_age %>% select(channel, amount, region)
test_x_age  <- test_age %>% select(channel, amount, region)

train_y_age <- train_age$age
test_y_age  <- test_age$age

cat("Training models with features: channel, amount, region\n\n")
Training models with features: channel, amount, region
Show Code
# --- Model 1: Linear Regression ---
set.seed(1234)
model_lr_age <- train(
  x          = train_x_age,
  y          = train_y_age,
  method     = "lm",
  trControl  = train_control,
  preProcess = c("center", "scale")
)
cat("Linear Regression — Trained\n")
Linear Regression — Trained
Show Code
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_age <- train(
  x          = train_x_age,
  y          = train_y_age,
  method     = "rf",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 3
)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Show Code
cat("Random Forest — Trained\n")
Random Forest — Trained
Show Code
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_age <- train(
  x          = train_x_age,
  y          = train_y_age,
  method     = "rpart",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 10
)
cat("Decision Tree — Trained\n")
Decision Tree — Trained
Show Code
cat("\nAll three models trained successfully.\n")

All three models trained successfully.

6.5 Age Comparison Table And Best Model

Show Code
# --- Predictions on test set ---
pred_lr_age <- predict(model_lr_age, test_x_age)
pred_rf_age <- predict(model_rf_age, test_x_age)
pred_dt_age <- predict(model_dt_age, test_x_age)

# --- Model comparison table ---
model_comparison_age <- tibble(
  Model = c("Linear Regression",
            "Random Forest",
            "Decision Tree"),
  RMSE  = c(
    round(RMSE(pred_lr_age, test_y_age), 4),
    round(RMSE(pred_rf_age, test_y_age), 4),
    round(RMSE(pred_dt_age, test_y_age), 4)
  ),
  MAE = c(
    round(MAE(pred_lr_age, test_y_age), 4),
    round(MAE(pred_rf_age, test_y_age), 4),
    round(MAE(pred_dt_age, test_y_age), 4)
  ),
  R_Squared = c(
    round(R2(pred_lr_age, test_y_age), 4),
    round(R2(pred_rf_age, test_y_age), 4),
    round(R2(pred_dt_age, test_y_age), 4)
  )
) %>%
  arrange(RMSE)

# --- Display comparison table ---
model_comparison_age %>%
  kable(
    col.names = c("Model", "RMSE", "MAE", "R-Squared"),
    caption   = "Table: Model Comparison — Age Prediction"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  row_spec(1, background = "rgba(0, 212, 255, 0.08)")
Table: Model Comparison — Age Prediction
Model RMSE MAE R-Squared
Decision Tree 13.8861 11.8376 0.2182
Random Forest 13.8890 11.8419 0.2179
Linear Regression 14.0959 11.9719 0.1944
Show Code
# --- Best model statement ---
best_model_age     <- model_comparison_age$Model[1]
best_rmse_age      <- model_comparison_age$RMSE[1]
best_mae_age       <- model_comparison_age$MAE[1]
best_rsquared_age  <- model_comparison_age$R_Squared[1]

cat("Best Performing Model:", best_model_age, "\n")
Best Performing Model: Decision Tree 
Show Code
cat("RMSE:                 ", best_rmse_age, "\n")
RMSE:                  13.8861 
Show Code
cat("MAE:                  ", best_mae_age, "\n")
MAE:                   11.8376 
Show Code
cat("R-Squared:            ", best_rsquared_age, "\n")
R-Squared:             0.2182 

6.6 Actual vs Predicted Table For The Decision Tree

Show Code
# --- Actual vs Predicted comparison ---
age_results <- tibble(
  Actual    = test_y_age,
  Predicted = round(pred_dt_age, 1),
  Error     = round(abs(test_y_age - pred_dt_age), 1)
) %>%
  mutate(
    Error_Band = case_when(
      Error <= 5  ~ "Within 5 years",
      Error <= 10 ~ "Within 10 years",
      Error <= 15 ~ "Within 15 years",
      TRUE        ~ "More than 15 years"
    )
  )

# --- Error band summary ---
error_summary <- age_results %>%
  group_by(Error_Band) %>%
  summarise(
    Count      = n(),
    Percentage = round(Count / nrow(age_results) * 100, 2)
  ) %>%
  arrange(Percentage %>% desc())

# --- Display error band table ---
error_summary %>%
  kable(
    col.names = c("Prediction Accuracy", "Count", "Percentage (%)"),
    caption   = "Table: Actual vs Predicted Age — Error Band Summary"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff")
Table: Actual vs Predicted Age — Error Band Summary
Prediction Accuracy Count Percentage (%)
More than 15 years 5506 34.43
Within 5 years 3567 22.30
Within 10 years 3515 21.98
Within 15 years 3405 21.29
Show Code
# --- Overall prediction summary ---
cat("\nAge Prediction Summary: Decision Tree:\n")

Age Prediction Summary: Decision Tree:
Show Code
cat("  Mean Absolute Error:", round(mean(age_results$Error), 2), "years\n")
  Mean Absolute Error: 11.84 years
Show Code
cat("  Median Error:       ", round(median(age_results$Error), 2), "years\n")
  Median Error:        11.5 years
Show Code
cat("  Min Error:          ", round(min(age_results$Error), 2), "years\n")
  Min Error:           0 years
Show Code
cat("  Max Error:          ", round(max(age_results$Error), 2), "years\n")
  Max Error:           28.6 years

6.7 Save The Best Age Model

Show Code
# --- Save best model for age prediction ---
saveRDS(model_dt_age, "best_model_age.rds")

cat("Best model for age prediction saved successfully.\n")
Best model for age prediction saved successfully.
Show Code
cat("Model: Decision Tree\n")
Model: Decision Tree
Show Code
cat("File:  best_model_age.rds\n")
File:  best_model_age.rds

6.8 Interpretation: Age Prediction

The Decision Tree achieved the lowest RMSE of 13.89 years and an R-Squared of 0.2182, marginally outperforming Random Forest (RMSE = 13.89, R² = 0.2179) and comfortably ahead of Linear Regression (RMSE = 14.10, R² = 0.1944).

However, the error band analysis reveals the honest limitations of this model:

  • Only 22.30% of predictions fall within 5 years of the actual age
  • 34.43% of predictions are off by more than 15 years, the largest single category
  • The mean absolute error is 11.84 years, meaning on average the model misjudges a customer’s age by nearly 12 years

This confirms the upfront acknowledgement that age is a challenging variable to predict from the available data. With an R-Squared of just 0.2182, channel, amount, and region together explain only 21.82% of the variance in customer age, the remaining 78.18% is driven by factors not captured in this dataset.

Recommendation: This model is not reliable enough for operational use in its current form. Blackwell would need richer customer data, such as browsing behaviour, product categories purchased, or loyalty programme data, to build a meaningfully accurate age prediction model. The finding itself is valuable however, confirming that transaction data alone is insufficient to infer customer age.

7 Region Prediction

The third and final modelling task investigates whether the region a customer belongs to can be predicted from their demographic and transaction data. This is a multi-class classification problem, the target variable has four possible outcomes: North, South, East, and West.

Successfully predicting customer region would allow Blackwell to infer the geographic location of new or anonymous customers from their transaction behaviour alone, enabling region-specific marketing without requiring customers to explicitly provide their location.

Candidate predictor variables: age, channel, items, amount

7.1 Data Splitting

Show Code
# --- 80/20 Train/Test Split ---
set.seed(1234)
train_index_region <- createDataPartition(df$region, p = 0.8, list = FALSE)

train_region <- df[train_index_region, ]
test_region  <- df[-train_index_region, ]

cat("Region Prediction — Data Split Summary:\n")
Region Prediction — Data Split Summary:
Show Code
cat("  Training set:", nrow(train_region), "rows\n")
  Training set: 63985 rows
Show Code
cat("  Test set:    ", nrow(test_region), "rows\n\n")
  Test set:     15994 rows
Show Code
# --- Verify class balance in split ---
cat("Training set region distribution:\n")
Training set region distribution:
Show Code
print(prop.table(table(train_region$region)))

    North     South      East      West 
0.2000156 0.2499961 0.2250527 0.3249355 
Show Code
cat("\nTest set region distribution:\n")

Test set region distribution:
Show Code
print(prop.table(table(test_region$region)))

    North     South      East      West 
0.2000125 0.2499687 0.2250844 0.3249344 

7.2 Feature Selection

Show Code
# --- Recursive Feature Elimination for Region Prediction ---
rfe_control_region <- rfeControl(
  functions = rfFuncs,
  method    = "cv",
  number    = 5,
  verbose   = FALSE
)

# --- Prepare features and target ---
# Convert factors to numeric for RFE compatibility
train_region_rfe <- train_region %>%
  mutate(channel_num = as.numeric(channel))

rfe_features_region <- train_region_rfe %>%
  select(age, channel_num, items, amount)

rfe_target_region <- train_region_rfe$region

# --- Run RFE ---
set.seed(1234)
rfe_result_region <- rfe(
  x          = rfe_features_region,
  y          = rfe_target_region,
  sizes      = c(1, 2, 3, 4),
  rfeControl = rfe_control_region
)

# --- Results ---
cat("Optimal Features Selected:\n")
Optimal Features Selected:
Show Code
print(predictors(rfe_result_region))
[1] "channel_num" "amount"      "age"        
Show Code
cat("\nRFE Performance Summary:\n")

RFE Performance Summary:
Show Code
print(rfe_result_region)

Recursive feature selection

Outer resampling method: Cross-Validated (5 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD   KappaSD Selected
         1   0.4500 0.2904  2.158e-05 0.0002476         
         2   0.6263 0.4970  1.235e-03 0.0018071         
         3   0.6427 0.5166  3.726e-03 0.0050547        *
         4   0.6315 0.5009  3.588e-03 0.0049032         

The top 3 variables (out of 3):
   channel_num, amount, age
Show Code
# --- RFE plot ---
ggplot(rfe_result_region) +
  labs(
    title    = "Figure: RFE — Feature Selection for Region Prediction",
    subtitle = "Accuracy by number of features selected",
    x        = "Number of Features",
    y        = "Accuracy (Cross-Validated)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.background   = element_rect(fill = "#12121a", color = NA),
    panel.background  = element_rect(fill = "#1e1e2e", color = "#2a2a3a"),
    panel.grid.major  = element_line(color = "#3a3a4a"),
    panel.grid.minor  = element_line(color = "#2a2a3a"),
    plot.title        = element_text(color = "#ffffff",
                                     face  = "bold", size = 12),
    plot.subtitle     = element_text(color = "#8a8a9a", size = 10),
    axis.text         = element_text(color = "#e8e8e8", size = 10),
    axis.title        = element_text(color = "#e8e8e8", size = 11),
    axis.line         = element_line(color = "#8a8a9a"),
    legend.background = element_rect(fill = "#1e1e2e", color = NA),
    legend.text       = element_text(color = "#e8e8e8"),
    legend.title      = element_text(color = "#e8e8e8")
  )

7.3 Feature Selection Results

Recursive Feature Elimination identified three optimal features for predicting customer region: channel, amount, and age.

Notably, items has now been eliminated across all three modelling tasks, consistently contributing no meaningful predictive power throughout this entire analysis.

Accuracy improves from 0.45 with one feature to 0.6427 with three features. Adding the fourth feature (items) actually reduces accuracy to 0.6315, confirming the three-feature model as optimal.

All three models will be trained using channel, amount, and age as predictor variables.

7.4 Region Model Training

Show Code
# --- Prepare train and test sets with RFE selected features ---
train_x_region <- train_region %>% select(age, amount, channel)
test_x_region  <- test_region %>% select(age, amount, channel)

train_y_region <- train_region$region
test_y_region  <- test_region$region

cat("Training models with features: age, amount, channel\n\n")
Training models with features: age, amount, channel
Show Code
# --- Model 1: Multinomial Logistic Regression ---
set.seed(1234)
model_lr_region <- train(
  x          = train_x_region,
  y          = train_y_region,
  method     = "multinom",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  trace      = FALSE
)
cat("Multinomial Logistic Regression — Trained\n")
Multinomial Logistic Regression — Trained
Show Code
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_region <- train(
  x          = train_x_region,
  y          = train_y_region,
  method     = "rf",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 3
)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Show Code
cat("Random Forest — Trained\n")
Random Forest — Trained
Show Code
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_region <- train(
  x          = train_x_region,
  y          = train_y_region,
  method     = "rpart",
  trControl  = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 10
)
cat("Decision Tree — Trained\n")
Decision Tree — Trained
Show Code
cat("\nAll three models trained successfully.\n")

All three models trained successfully.

7.5 Region Comparison Table And Best Model

Show Code
# --- Predictions on test set ---
pred_lr_region <- predict(model_lr_region, test_x_region)
pred_rf_region <- predict(model_rf_region, test_x_region)
pred_dt_region <- predict(model_dt_region, test_x_region)

# --- Confusion matrices ---
cm_lr_region <- confusionMatrix(pred_lr_region, test_y_region)
cm_rf_region <- confusionMatrix(pred_rf_region, test_y_region)
cm_dt_region <- confusionMatrix(pred_dt_region, test_y_region)

# --- Model comparison table ---
model_comparison_region <- tibble(
  Model    = c("Multinomial Logistic Regression",
               "Random Forest",
               "Decision Tree"),
  Accuracy = c(cm_lr_region$overall["Accuracy"],
               cm_rf_region$overall["Accuracy"],
               cm_dt_region$overall["Accuracy"]),
  Kappa    = c(cm_lr_region$overall["Kappa"],
               cm_rf_region$overall["Kappa"],
               cm_dt_region$overall["Kappa"])
) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  arrange(desc(Accuracy))

# --- Display comparison table ---
model_comparison_region %>%
  kable(
    col.names = c("Model", "Accuracy", "Kappa"),
    caption   = "Table: Model Comparison Region Prediction"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  row_spec(1, background = "rgba(0, 212, 255, 0.08)")
Table: Model Comparison Region Prediction
Model Accuracy Kappa
Decision Tree 0.6441 0.5175
Random Forest 0.6233 0.4892
Multinomial Logistic Regression 0.6090 0.4733
Show Code
# --- Best model statement ---
best_model_region    <- model_comparison_region$Model[1]
best_accuracy_region <- round(model_comparison_region$Accuracy[1] * 100, 2)
best_kappa_region    <- model_comparison_region$Kappa[1]

cat("Best Performing Model:", best_model_region, "\n")
Best Performing Model: Decision Tree 
Show Code
cat("Test Set Accuracy:    ", best_accuracy_region, "%\n")
Test Set Accuracy:     64.41 %
Show Code
cat("Kappa Statistic:      ", best_kappa_region, "\n")
Kappa Statistic:       0.5175 

7.6 Actual vs Predicted Table: Decision Tree

Show Code
# --- Actual vs Predicted comparison ---
region_results <- tibble(
  Actual    = test_y_region,
  Predicted = pred_dt_region,
  Correct   = test_y_region == pred_dt_region
)

# --- Actual vs Predicted summary table ---
region_summary <- region_results %>%
  group_by(Actual, Predicted) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(
    Result = ifelse(Actual == Predicted,
                    "Correct", "Incorrect")
  ) %>%
  arrange(Actual, Predicted)

# --- Display actual vs predicted table ---
region_summary %>%
  kable(
    col.names = c("Actual", "Predicted",
                  "Count", "Result"),
    caption   = "Table: Actual vs Predicted: Region (Decision Tree)"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff") %>%
  column_spec(4, bold = TRUE,
              color = ifelse(region_summary$Result == "Correct",
                             "#00ff9d", "#ff6b6b"))
Table: Actual vs Predicted: Region (Decision Tree)
Actual Predicted Count Result
North North 1739 Correct
North East 652 Incorrect
North West 808 Incorrect
South South 3995 Correct
South West 3 Incorrect
East North 1426 Incorrect
East South 152 Incorrect
East East 724 Correct
East West 1298 Incorrect
West North 1081 Incorrect
West South 272 Incorrect
West West 3844 Correct
Show Code
# --- Overall prediction breakdown ---
cat("\nRegion Prediction Breakdown:\n")

Region Prediction Breakdown:
Show Code
cat("  Correctly classified:  ", sum(region_results$Correct), "\n")
  Correctly classified:   10302 
Show Code
cat("  Incorrectly classified:", sum(!region_results$Correct), "\n")
  Incorrectly classified: 5692 
Show Code
cat("  Overall Accuracy:      ",
    round(mean(region_results$Correct) * 100, 2), "%\n")
  Overall Accuracy:       64.41 %
Show Code
# --- Per region accuracy ---
cat("\nPer Region Accuracy:\n")

Per Region Accuracy:
Show Code
region_results %>%
  group_by(Actual) %>%
  summarise(
    Total   = n(),
    Correct = sum(Correct),
    Accuracy = round(mean(Correct) * 100, 2)
  ) %>%
  print()
# A tibble: 4 × 4
  Actual Total Correct Accuracy
  <fct>  <int>   <int>    <dbl>
1 North   3199    1739   173900
2 South   3998    3995   399500
3 East    3600     724    72400
4 West    5197    3844   384400

7.7 Save Region Best Model

Show Code
# --- Save best model for region prediction ---
saveRDS(model_dt_region, "best_model_region.rds")

cat("Best model for region prediction saved successfully.\n")
Best model for region prediction saved successfully.
Show Code
cat("Model: Decision Tree\n")
Model: Decision Tree
Show Code
cat("File:  best_model_region.rds\n")
File:  best_model_region.rds

7.8 Interpretation: Region Prediction

The Decision Tree outperformed both Random Forest and Multinomial Logistic Regression, achieving a test set accuracy of 64.41% and a Kappa of 0.5175, indicating moderate agreement beyond chance for a four-class prediction problem.

The per region accuracy reveals a striking performance imbalance:

  • South, 99.92% accuracy, virtually perfect. The South’s distinct customer profile, older age, low spending ceiling of $499.94, makes it uniquely identifiable from other regions.
  • West ,73.96% accuracy, good performance, driven by the West’s young, high-spending customer profile.
  • North, 54.36% accuracy, moderate. North customers share demographic overlap with both East and West, making them harder to distinguish.
  • East, 20.11% accuracy, poor. The majority of East customers are misclassified as North (1,426 cases), suggesting the two regions have very similar demographic and spending profiles that the model cannot reliably separate.

Overall, 10,302 out of 15,994 customers were correctly classified, with 5,692 misclassifications, concentrated almost entirely in the East region.

Recommendation: The model is reliable for identifying South and West customers, but struggles significantly with East region customers. Blackwell should treat East region predictions with caution. Richer data such as product category preferences or browsing behaviour would likely improve East vs North discrimination considerably.

8 Conclusions & Recommendations

This section brings together the key findings from both the Exploratory Data Analysis and the Predictive Modelling phases into a concise, actionable summary for Danielle Sherman and Martin Goodrich. Every conclusion drawn here is grounded directly in the data, no assumptions, no generalisations.

Show Code
# --- Final findings summary table ---
conclusions <- tibble(
  Business_Question = c(
    "Do customers in different regions spend more per transaction?",
    "Is there a relationship between items purchased and amount spent?",
    "Do older customers spend more than younger customers?",
    "Are in-store customers older than online customers?",
    "Are there age differences across regions?",
    "Can purchase channel be predicted?",
    "Can customer age be predicted?",
    "Can customer region be predicted?"
  ),
  Finding = c(
    "Yes, significant regional differences exist. West leads ($1,283.94), South trails ($252.10) with a spending ceiling of $499.94.",
    "No, Pearson r = 0.0004. No meaningful relationship exists between items purchased and amount spent.",
    "No, the data contradicts this hypothesis. Younger customers spend more. Pearson r = -0.282.",
    "No, the data contradicts this hypothesis. Online customers are older (mean 48.56) than in-store (mean 42.96).",
    "Yes, statistically significant age differences exist across regions (F = 6,139.42, p < 0.0001). South is oldest (56.6), West is youngest (38.8).",
    "Yes — Decision Tree achieves 88.82% accuracy. Online customers identified with near-perfect precision.",
    "Partially Decision Tree RMSE = 13.89 years, R² = 0.22. Model explains only 21.82% of age variance. Not reliable for operational use.",
    "Partially, Decision Tree achieves 64.41% accuracy. South (99.92%) and West (73.96%) predicted well. East poorly predicted (20.11%)."
  ),
  Recommendation = c(
    "Allocate higher marketing budgets to West and East. Deploy value-oriented strategy for South. Investigate absence of high-value purchases in South.",
    "Bundling and volume-based pricing strategies are unlikely to drive higher revenue. Focus on product type and price point instead.",
    "Marketing campaigns targeting older customers as higher spenders are not justified. Focus high-value campaigns on younger demographics (18-39).",
    "The website already attracts older customers. Focus should shift to understanding why younger in-store customers generate higher transaction values.",
    "Regional marketing should account for demographic differences. South needs trust-building, value-oriented messaging. West and East suit premium campaigns.",
    "Deploy the Decision Tree model to proactively segment customers by channel for targeted marketing communications.",
    "Transaction data alone is insufficient to predict age reliably. Richer data, browsing behaviour, product categories are needed.",
    "Model is reliable for South and West segmentation. East region predictions should be treated with caution until richer data is available."
  )
)

# --- Display conclusions table ---
conclusions %>%
  kable(
    col.names = c("Business Question", 
                  "Finding", 
                  "Recommendation"),
    caption   = "Table: Summary of Findings & Recommendations"
  ) %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, bold = TRUE, color = "#00d4ff", width = "25%") %>%
  column_spec(2, width = "40%") %>%
  column_spec(3, width = "35%")
Table: Summary of Findings & Recommendations
Business Question Finding Recommendation
Do customers in different regions spend more per transaction? Yes, significant regional differences exist. West leads ($1,283.94), South trails ($252.10) with a spending ceiling of $499.94. Allocate higher marketing budgets to West and East. Deploy value-oriented strategy for South. Investigate absence of high-value purchases in South.
Is there a relationship between items purchased and amount spent? No, Pearson r = 0.0004. No meaningful relationship exists between items purchased and amount spent. Bundling and volume-based pricing strategies are unlikely to drive higher revenue. Focus on product type and price point instead.
Do older customers spend more than younger customers? No, the data contradicts this hypothesis. Younger customers spend more. Pearson r = -0.282. Marketing campaigns targeting older customers as higher spenders are not justified. Focus high-value campaigns on younger demographics (18-39).
Are in-store customers older than online customers? No, the data contradicts this hypothesis. Online customers are older (mean 48.56) than in-store (mean 42.96). The website already attracts older customers. Focus should shift to understanding why younger in-store customers generate higher transaction values.
Are there age differences across regions? Yes, statistically significant age differences exist across regions (F = 6,139.42, p < 0.0001). South is oldest (56.6), West is youngest (38.8). Regional marketing should account for demographic differences. South needs trust-building, value-oriented messaging. West and East suit premium campaigns.
Can purchase channel be predicted? Yes — Decision Tree achieves 88.82% accuracy. Online customers identified with near-perfect precision. Deploy the Decision Tree model to proactively segment customers by channel for targeted marketing communications.
Can customer age be predicted? Partially Decision Tree RMSE = 13.89 years, R² = 0.22. Model explains only 21.82% of age variance. Not reliable for operational use. Transaction data alone is insufficient to predict age reliably. Richer data, browsing behaviour, product categories are needed.
Can customer region be predicted? Partially, Decision Tree achieves 64.41% accuracy. South (99.92%) and West (73.96%) predicted well. East poorly predicted (20.11%). Model is reliable for South and West segmentation. East region predictions should be treated with caution until richer data is available.

8.1 Closing Remarks

This analysis has delivered data-driven answers to every business question posed by Danielle Sherman and tested both hypotheses raised by Martin Goodrich. The findings are clear:

Martin’s hypotheses are not supported by the data. Older customers do not spend more, and in-store customers are not older than online customers. Any business decisions based on these assumptions would be misdirected.

Regional differences are the most actionable insight. The West and East regions represent Blackwell’s highest-value customer segments, while the South presents a fundamentally different and more constrained spending profile that warrants its own dedicated strategy.

The Decision Tree model is the strongest performer across all three prediction tasks achieving 88.82% accuracy for channel prediction, and providing reliable region segmentation for South and West customers.

The most important limitation identified in this analysis is data richness. Transaction data alone, five variables across 80,000 records has clear predictive limits, particularly for age prediction. Blackwell’s next step should be enriching their customer data collection to include browsing behaviour, product category preferences, and loyalty programme engagement. This would unlock significantly more powerful predictive models and deeper customer insights.

Analysis conducted by Freda Erinmwingbovo ; February 2026